Active Learning for Logistic Regression
نویسندگان
چکیده
ACTIVE LEARNING FOR LOGISTIC REGRESSION Andrew Ian Schein Supervisor: Lyle H. Ungar Which active learning methods can we expect to yield good performance in learning logistic regression classifiers? Addressing this question is a natural first step in providing robust solutions for active learning across a wide variety of exponential models including maximum entropy, generalized linear, loglinear, and conditional random field models. We extend previous work on active learning using explicit objective functions by developing a framework for implementing a wide class of loss functions for active learning of logistic regression, including variance (A-optimality) and log loss reduction. We then run comparisons against different variations of the most widely used heuristic schemes: query by committee and uncertainty sampling, to discover which methods work best for different classes of problems and why. Our approach to loss functions for active learning borrows from the field of optimal experimental design in statistics. We exploit several properties of nonlinear regression models that allow computation of the variance of a prediction with respect to the model’s input distribution. The strategy of minimizing prediction variance is referred to as A-optimality. A Taylor series approximation of many loss functions conveniently factorizes into alternative weightings of this variance computation. We investigate squared and log loss within this framework. Our empirical evaluations are the largest effort to date to evaluate explicit objective function methods in active learning. We employed ten data sets in the evaluation from domains such as image recognition and document classification. The data sets vary in number of categories from 2 to 26 and have as many as 6, 191 predictors. This work establishes the benefits of these often cited (but rarely used) strategies, and counters the claim that experimental design methods are too computationally vii complex to run on interesting data sets. The two loss functions were the only methods we tested that always performed at least as well as a randomly selected training set. The same data were used to evaluate several heuristic methods, including uncertainty sampling, heuristic variants of the query by committee method, and a method that maximizes classifier certainty. Uncertainty sampling was tested using two different measures of uncertainty: Shannon entropy and margin size. Margin-based uncertainty sampling was found to be superior; however, both methods perform worse than random sampling at times. We show that these failures to match random sampling can be caused by predictor space regions of varying noise or model mismatch. The various heuristics produced mixed results overall in the evaluation, and it is impossible to select one as particularly better than the others when using classifier accuracy as the sole criterion for performance. Margin sampling is the favored approach when computational time is considered along with accuracy.
منابع مشابه
A Benchmark and Comparison of Active Learning for Logistic Regression
Various active learning methods based on logistic regression have been proposed. In this paper, we investigate seven state-of-the-art strategies, present an extensive benchmark, and provide a better understanding of their underlying characteristics. Experiments are carried out both on 3 synthetic datasets and 43 real-world datasets, providing insights into the behaviour of these active learning...
متن کاملActive Learning for Multi-Class Logistic Regression
Which of the many proposed methods for active learning can we expect to yield good performance in learning logistic regression classifiers? In this article, we evaluate different approaches to determine suitable practices. Among our contributions, we test several explicit objective functions for active learning: an empirical consideration lacking in the literature until this point. We develop a...
متن کاملA-Optimality for Active Learning of Logistic Regression Classifiers
Over the last decade there has been growing interest in pool-based active learning techniques, where instead of receiving an i.i.d. sample from a pool of unlabeled data, a learner may take an active role in selecting examples from the pool. Queries to an oracle (a human annotator in most applications) provide label information for the selected observations, but at a cost. The challenge is to en...
متن کاملSample size determination for logistic regression
The problem of sample size estimation is important in medical applications, especially in cases of expensive measurements of immune biomarkers. This paper describes the problem of logistic regression analysis with the sample size determination algorithms, namely the methods of univariate statistics, logistics regression, cross-validation and Bayesian inference. The authors, treating the regr...
متن کاملActive Learning with Rationales for Text Classification
We present a simple and yet effective approach that can incorporate rationales elicited from annotators into the training of any offthe-shelf classifier. We show that our simple approach is effective for multinomial naı̈ve Bayes, logistic regression, and support vector machines. We additionally present an active learning method tailored specifically for the learning with rationales framework.
متن کاملHyperspectral segmentation with active learning
This paper introduces a new supervised Bayesian approach to hyperspectral image segmentation, with two main steps: (a) learning, for each class label, the posterior probability distributions, based on a multinomial logistic regression model; (b) segmenting the hyperspectral image, based on the posterior probability distribution learnt in step (a) and on a multi-level logistic prior encoding the...
متن کامل